Fitting a LASSO regression model and publishing to Azure ML using R

Introduction

About the method

LASSO, which stands for Least Absolute Shrinkage and Selection Operator, is one of the model complexity control techniques like variable selection and ridge regression. In this notebook we'll demonstrate how to use the glmnet package for LASSO regression. For more information about LASSO you can refer to the LASSO Page.

Target audience

This notebook is targeted toward data scientists who understand linear regression and want to find out how to fit LASSO regression in R. An operationalization step is also included to show how you can deploy in Azure a web service based on the selected model.

Data

In this example, we'll use the housing data from the R package MASS. There are 506 rows and 14 columns in the dataset. Available information includes median home price, average number of rooms per dwelling, crime rate by town, etc. More information about this dataset can be found at UCI or by typing help(Boston) in an R terminal.



In [ ]:

    
library(MASS) # to use the Boston dataset
?Boston

Analysis

For illustration purposes, we'll use "medv" - median home price - as the response variable and the remaining variables as predictors.

The first step in fitting LASSO regression is to determine the value of tuning parameter λ which controls the overall strength of the penalty. Here we'll use cross-validation to choose the λ that gives the least validation error.



In [ ]:

    
# to make results replicable
set.seed(123)

# load libraries
if(!require("glmnet", quietly = TRUE)) install.packages("glmnet")
library(glmnet) # to fit a LASSO model
library(MASS) # to use the Boston dataset



In [ ]:

    
# define response variable and predictor variables
response_column <- which(colnames(Boston) %in% c("medv"))
train_X <- data.matrix(Boston[, -response_column])
train_y <- Boston[,response_column]

# use cv.glmnet with 10-fold cross-validation to pick lambda
model1 <- cv.glmnet(x = train_X, 
                    y = train_y, 
                    alpha = 1, 
                    nfolds = 10, 
                    family = "gaussian", 
                    type.measure = "mse")



In [ ]:

    
summary(model1)
options(repr.plot.width = 5, repr.plot.height = 5)
plot(model1)

In the plot above:

The red dotted line shows the cross-validation error and the error bars show the uppper and lower standard deviation.
The dotted vertical line to the left is for the optimal λ that gives minimum mean cross-validation error.
The vertical line to the right is for the λ where cross-validation error falls within one standard error of the minimum error.
The number of nonzero coefficients for different λ is shown along the axis at the top. The values for λ and associated coefficients are printed below.

Lambda that gives minimum mean cross-validated error:



In [ ]:

    
round(model1$lambda.min, 4)

Largest lambda with mean cross-validated error within 1 standard error of the minimum error:



In [ ]:

    
round(model1$lambda.1se, 4)

Coefficients based on lambda that gives minimum mean cross-validated error:



In [ ]:

    
coef(model1, s = "lambda.min")

While the model selected by cv.glmnet() can be used for making predictions, we also want to better understand how the values of λ impact the estimated coefficients. Such information can be produced by the glmnet() function. In the plot that's generated below, it can be observed that the coefficients shrink toward zero as the value of λ increases.



In [ ]:

    
model2 <- glmnet(x = train_X, 
                 y = train_y, 
                 alpha = 1, 
                 family = "gaussian")

# identify variable names
vn = colnames(train_X)
vid <- as.character(seq(1, length(vn)))

# check and exclude the variables with coefficient value 0 
vnat = coef(model2)
vnat_f <- vnat[-1, ncol(vnat)] 
vid <- vid[vnat_f != 0]
vn <- vn[vnat_f != 0]

#define the legend description, line type, and line color
nvars <- length(vn)
legend_desc <- paste(vid, vn, sep=": ")
legend_desc
mylty <- rep(1,nvars)
mycl <- seq(1,nvars)

Varying lambda

By increasing the value of lambda, the regression parameters are increasingly penalised, and thus move closer to zero.

In the lambda plot below, you can see how the coefficients gradually decrease in value as lambda increaes. This has a particularly high impact on variable 5 (nox, nitrogen oxides concentration (parts per 10 million)).

This shows the value of LASSO regression: the algorithm deals very well with problematic predictors, for example situations where the predictors are higly correlated with one another (multi-collinearity).



In [ ]:

    
# plot
options(repr.plot.width = 6, repr.plot.height = 6)
plot(model2, xvar = "lambda", label = TRUE, col = mycl, xlim = c(-5.5, 2))
legend(-0.5,-2, legend_desc, lty = mylty, col = mycl, cex = 0.8)

The coefficients from this model with the optimal λ are also printed below. As we would expect, they are the same as those from using cv.glmnet().



In [ ]:

    
coef(model2, s = model1$lambda.min)

To make predictions, you can use either of the two models:



In [ ]:

    
x_new <- data.matrix(train_X[, -response_column])
pred1 <- predict(model1, newx = x_new, s = "lambda.min")
pred2 <- predict(model2, newx = x_new, s = model1$lambda.min)

head(
    data.frame(actual = Boston$medv, model1 = as.vector(pred1), model2 = as.vector(pred2))
    )

Web service

Deploy a web service

With the developed model, we can deploy a web service on Azure so that others can use it to make predictions. The "AzureML" package will be used for this purpose.



In [ ]:

    
# define predict function
mypredict <- function(newdata){
    if("medv" %in% names(newdata)) {
        w <- match("medv", names(newdata))
        newdata <- newdata[, -w]
        }
  require(glmnet)
  newdata <- data.matrix(newdata) # the prediction data need to be a matrix for glmnet
  predict(model2, newx = newdata, s = model1$lambda.min)
}

# test the prediction function
newdata <- Boston[1:10, ]
mypredict(newdata)



In [ ]:

    
# load the library
library(AzureML)

# workspace information
ws <- workspace()

# deploy the service
ep <- publishWebService(ws = ws, fun = mypredict, 
                        name = "LASSOPrediction", 
                        inputSchema = newdata, 
                        outputSchema = list(ans = "numeric"))
str(ep)

Consume a web service

With information about the workspace and and service ID, we can consume the web service with the following code.



In [ ]:

    
newdata <- Boston[1:10, ]
pred <- consume(ep, newdata)$ans
data.frame(actual = newdata$medv, prediction = pred)

Conclusion

Using the Boston housing dataset, we demonstrated how to carry out LASSO regression analysis. We started the analysis by determining the optimal value of the tuning parameter λ using cross-validation. Then we examined the impact of λ on the coefficient estimates. A web service was deployed based on the selected model.